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Uncoordinated  checkpointing  for  message-passing  systems  allows  maximum  process  autonomy  and 
general  nondeterministic  execution,  but  suffers  from  potential  domino  effect  and  the  large  space  overhead 
for  maintaining  checkpoints  and  message  logs.  Traditionally,  it  has  been  assumed  that  only  obsolete 
checkpoints  and  message  logs  before  the  global  recovery  line  can  be  garbage-collected.  Recently,  an 
rqiproach  to  identifying  all  garbage  checkpoints  based  on  recovery  line  transformation  and  decomposition 
has  been  developed.  We  show  in  this  paper  that  the  same  approadi  can  be  applied  to  the  problem  of 
identifying  all  garbage  message  logs  for  systems  requiring  message  logging  to  record  in-transit  messages. 
Corrununication  trace-driven  simulation  for  several  parallel  programs  is  used  to  evaluate  the  proposed 
algorithm. 
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1  Introduction 


Checkpointing  and  rollback  recoveiy  is  an  effective  ^proach  to  recovering  from  both  hardware  and 
software  errors.  During  normal  execution,  the  state  of  each  process  is  periodically  saved  as  a  checkpoint 
on  stable  storage,  and  can  be  restored  after  a  failure  in  order  to  avoid  the  costly  reexecution  from  the 
very  beginning.  Numerous  checkpointing  and  recovery  techniques  for  message-passing  systems  have  been 
proposed  in  the  literature.  They  can  be  classified  into  three  primary  categories:  uncoordinated  checkpointing, 
coordinated  dieckpointing  and  log-based  approach.  Uncoordinated  checkpointing  [1-3]  allows  maximum 
process  autonomyand  general  nondetenninisticexecutioa  Each  process  takes  itschedqroints  indepoidaitly 
and  keeps  trade  of  the  dependencies  among  checkpoints  resulted  from  message  communications.  When  a 
failure  occurs,  the  dependency  information  is  used  to  determine  the  recovery  line  to  which  the  system  should 
roll  back.  The  major  disadvantages  of  uncoordinated  checkpointing  have  beat  the  potential  domino  effect 
[4,5],  i.e.,  when  cyclic  rollback propagationpteveats  recovery  line  progression,  and  the  space  overhead  for 
maintaining  checkpoints  and  message  logs. 

Coordinated  checkpointing  [6-11]  eliminates  the  domino  effect  by  sacrificing  a  certain  degree  of 
process  autonomy  and  incurring  run-time  and  message  overhead.  Processes  are  required  to  coordinate  their 
chedepointing  actions  in  order  to  guarantee  the  consistency  of  corresponding  checkpoints  and  hence  tire 
recovery  line  progression.  In  one  experimoit,  it  has  been  shown  that  the  run-time  overhead  for  coordinated 
checkpointing  can  be  made  reasonably  small  for  a  set  of  benchmark  programs  if  optimization  techniques 
primarily  involvir^  changes  to  the  operating  systems  can  be  employed  [12].  However,  for  many  practical 
applications  in  which  modifying  the  operating  system  is  not  considered  a  feasible  solution,  process  autonomy 
in  taking  plication-level  checkpoints  is  essoitial  for  reducing  run-time  overhead  by  checkpointing  when 
the  process  state  is  minimal. 

Log-based  approach  [13-21]  eliminates  die  domino  effect  by  assuming  piecewise  deterministic  execu¬ 
tion  model  [22]  which  views  process  execution  as  consisting  of  a  number  of  deterministic  state  intervals, 
each  started  by  a  nondeterministic  event  such  as  processing  a  new  message.  Nondeterministic  event  logging 
in  addition  to  dredepointing  is  onployed  to  reduce  rollback  propagation  through  deterministic  state  recon- 
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struction.  However,  it  has  been  pointed  out  that  the  assumption  of  piecewise  determinism  may  not  be  valid 
for  the  entire  process  execution,  and  hence  the  support  for  general  nondeterministic  execution  is  important 
[23]. 

This  p^>er  mainly  considers  uncoordinated  checkpointing.  The  recovery  line  progression  problon 
is  addressed  elsewhere  [24].  Essaidally,  a  domino-firee  unifying  framework  can  be  built  by  considering 
uncoordinated  checkpointing  as  the  basic  scheme,  aid  checkpoint  coordination  (whoiever  desirable)  and 
exploiting  piecewise  determinism  (whenever  possible)  as  two  mechanisms  for  bounding  rollback  propaga¬ 
tion.  More  specifically,  communication-induced  checkpoints  in  a  lazy  checkpoint  coordination  scheme  [24], 
and  the  logical  checkpoints  [25]  obtained  through  evmt  logging  when  piecewise  determinism  is  available 
provide  additional  checkpoints  to  advance  the  recovoy  line. 

The  main  focus  of  this  pqier  is  on  the  space  overhead  problem  of  uncoordinated  checkpointing  (and 
the  unifying  framework  as  well).  Traditionally,  garbage  collection  has  been  based  on  the  notion  of  obsolete 
checkpoints  and  message  logs:  the  global  recovery  line  which  suffices  to  recover  from  the  failure  of  the 
entire  system  is  computed,  and  all  the  obsolete  checkpoints  and  message  logs  before  that  recovery  line  are 
no  longer  useful  and  can  be  discarded.  In  contrast,  ail  the  non-obsolete  checkpoints  and  message  logs  have 
been  assumed  to  be  possibly  useful  for  some  future  recovery  and  should  be  retained.  With  the  possiUlity  of 
domirx)  effects,  the  space  overhead  may  become  prohilntively  high. 

Motivated  by  the  observation  that  being  obsolete  is  simply  a  sufficient  condition  for  being  garbage,  we 
previously  derived  the  necessary  and  sufficient  condition  for  identifying  ail  garbage  checkpoints,  which  leads 
to  an  optimal  checkpoint  reclamation  algorithm  [26].  By  using  the  approach  of  recovery  line  transformation 
and  decomposition,  we  have  demonstrated  that  any  non-garbage  checkpoint  belonging  to  a  possible  future 
recovery  line  must  also  be  contained  in  one  of  the  iV  “immediate  future”  recovery  lines,  where  iV  is  the 
number  of  processes.  In  this  piqper,  we  apply  the  same  approach  to  solving  the  problem  of  optimal  message 
log  reclamation^  for  systems  requiring  message  logging  to  record  in-transit,  i.e.,  “sent  but  not  yet  received,” 

^  A  simple  si^ient  condition  based  on  local  information  has  been  presented  to  identify  some  garbage  messages  before  they  are 
logged  [3];  this  paper  derives  the  necessary  and  sufficient  condition  based  on  global  information  for  identifying  aU  garbage  message 
logs. 
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messages.  W:  will  show  that  any  non-gaibage  message  log  which  can  become  an  in-transit  message  with 
respect  to  a  possible  future  recovery  must  also  be  an  in-transit  message  with  respect  to  one  of  the  ^ 
“immediate  future”  recovery  lines.  We  wish  to  stress  that  the  message  logs  considered  in  this  paper  are  used 
to  record  the  state  of  the  channels  [8],  and  are  dififerent  from  the  message  logs  used  for  deterministic  r^lay  in 
the  log-based  recovery  schemes  (13, 15].  While  both  message  contents  and  ordinal  positions  are  important 
in  the  latter,  only  message  contents  are  needed  in  the  fonner.  In  Section  S,  we  will  also  dononstrate  the 
applicability  of  the  optimal  garbage  collection  algorithm  to  executions  that  exploit  piecewise  determinism. 

The  p^ieris  organized  as  follows.  Section  2  describes  the  checkpointing  and  recovery  protocol;  Sections 
derives  the  necessary  and  sufficient  condition  for  idoitifying  all  garbage  message  logs,  based  on  recovery 
line  transformation  arKl  decomposition;  experimental  evaluation  is  described  in  Section  4;  Section  S  extends 
our  work  to  a  partially-exploited  piecewise  deterministic  model,  and  Section  6  concludes  with  a  summary. 

2  Cbeckpointiog  and  Recovery  Protocol 

The  system  considered  in  this  paper  consists  of  a  number  of  concurrent  processes  for  which  all  process 
communication  is  through  message  passing.  Processes  are  assumed  to  run  on  fail-stop  processors  [27] 
and,  for  the  purpose  of  presentation,  each  process  is  considered  an  individual  recovery  unit.  In  order  to 
allow  gerreral  nondeterministic  execution,  we  do  not  assume  a  piecewise  deterministic  model.  This  implies 
whenever  the  sender  of  a  message  m  tolls  back  and  unsends  m,  the  receiver  whidi  has  already  processed 
m  must  also  toll  back  to  undo  the  effect  of  m  because  the  potential  nondeterminism  preceding  the  sending 
of  m  (Hg.  1(a))  may  prevent  the  same  message  from  being  resent  during  reexecution.  Let  Ci^  denote  the 
zth  checkpoint  (x  >  0)  of  process  p»  (0  <  *  <  iV  —  1),  where  N  is  the  number  of  processes  in  the  system. 
Two  checkpoints  c,-,x+i  and  Cj,y  are  then  considered  inconsistent  if  there  is  any  message  sent  after  Cj^y  and 
processed  before  i.e.,  Cj^y  ht^pened  before  [28],  or  vice  versa.  In  contrast,  when  the  receiver  p,- 

of  a  message  mf  rolls  back  and  unreceives  m'  (Fig.  1(b)),  the  sender  pj  may  not  need  to  roll  back  to  unsend 
m!.  If  the  acknowledge  message  for  every  normal  message  is  treated  as  an  additional  dependency-carrying 
message,  such  an  in-transit  message  m'  may  be  retrieved  through  a  reliable  end-to-end  transmissionprotocol 
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[17.29].  Alternatively,  message  m' can  be  retrieved  frcMn  a synduonousL  13, 14]oranasynchnHious[3,  IS] 
message  log.  Therefore,  chedcpoints  c,>  uid  in  Rg.  1(b)  are  cmisideied  crnisistott. 

During  nonnal  execution,  each  process  periodically  and  indq>endently  takes  its  checkpoints.  The 
interval  between  Ci^  and  is  called  the  xth  checkpoint  interval  of  pi,  denoted  by  (t,  z).  Each  message 
is  tagged  with  the  process  number  and  the  current  checkpoint  interval  number  of  the  sender,  and  each  receiver 
Pi  performs  direct  dependency  tracking  [1, 30]  as  follows:  if  a  message  sent  ftom  (j,  y)  is  processed  in 
(t,  z),  the  direct  dependency  of  et»t  on  Cj,y  is  recorded. 

A  garba^  collection  procedure  can  be  periodically  invrdred  by  any  process  pi.  First,  p,  collects  the 
direct  dependency  infonnaritMifnnn  all  the  other  processes  to  construct  the  checkpoint  grt^th  [1]  as  ^wn  in 
Fig.  2(b).  Then  the  rollback  propagation  algorithm  (Rg.  3)  is  applied  to  the  checkpoim  graph  to  determine 
the  global  recovery  lin^  (black  vertices  in  Rg.  2(b)),  before  wdiich  all  the  checkpoints  and  message  logs 
are  obsolete  and  can  be  discarded.  When  any  process  initiates  a  rollback,  it  starts  a  similar  procedure  for 
recovery.  The  currem  volatile  states  of  the  surviving  processes  are  treated  as  additiotud  virtual  checkpoints 
[2]  for  constructing  an  extended  checkpoint  graph  of  whidi  the  recovery  line  is  called  the  local  recovery 
line  (shaded  vertices)  and  indicates  the  consistent  state  for  the  system  to  roll  back  to. 

3  Optimal  Message  Log  Reclamation 

Since  the  purpose  of  message  logging  is  to  record  in-transit  messages  needed  for  rollback  recovery,  a 
message  log  is  non-garbage  if  and  only  if  it  can  become  an  in-transit  message  with  respect  to  a  possible 
future  recovery  line  or,  for  short,  intersect^  a  possible  future  recovery  line.  Wb  model  a  process  execution 
as  consisting  of  a  number  of  operational  sessions  [2]  and  recovery  sessions,  where  an  operational  session  is 
the  interval  between  the  start  of  normal  execution  and  the  instance  of  rollback  initiation,  and  betweoi  two 
consecutive  operational  sessions  is  a  recovery  session.  Since  a  future  process  execution  may  contain  any 

^The  global  recovery  line  is  oted  when  (he  entire  system  fails;  a  local  recovery  line  is  used  when  only  a  subset  of  processes 
becomes  faulty. 

^Any  message  can  only  mtarsect  a  recovery  line  from  the  left  to  the  right,  as  shown  in  Rg.  1(b),  because  intersecting  in  the  other 
direction  contradicts  the  fact  that  arecovety  Itee  caruiot  contain  any  inconsistent  checkpoints. 
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“H  Checkpoint  interval  (i^)  K- 


*^J.y  t  ^  ^  j.y 


'-Potential  nondeterministic  event 

(a)  (b) 

Hgure  1:  Checkpoint  consistency,  (a)  inconsistent  checkpoints  and  (b)  in-transit  message  m! 
and  consistent  diedcpoints  Cj^y  and  c,>. 


(a)  (b) 

Figure  2:  Checkpointing  and  lollbadc  recovery,  (a)  example  diedcpoint  and  communication  pattern;  (b) 
checkpoint  graph  and  extoided  checkpoint  graph  whoi  po  initiates  a  rollback. 


/*  CP  stands  for  checkpoint  bdtiaUy,  all  the  CPs  are  unmarked  */ 
include  the  latest  CP  of  eadt  process  in  the  root  set; 
mark  all  CPs  strictly  reachable  [31  ]  6xm  any  CP  in  the  root  set; 
while  (at  least  one  CP  in  the  root  set  is  marked)  { 
replace  each  marked  CP  in  the  root  set  by  the  latest  unmarked  CP  on  the  same  process; 
mark  all  CPs  strictly  reachable  horn  any  CP  in  the  root  set 

} 

the  root  set  is  the  recovery  line. 


Figure  3:  The  rollback  propagation  algorithm. 
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number  of  aiUtraiy  operational  sessions  and  recovery  sessions,  there  are  an  infinite  number  of  possible  future 
recovery  lines.  We  first  (tescribe  the  recovery  line  transformation  and  (fecomposidon  which  can  provide  a 
finite  s^  of  “immediide  future”  recovery  lines  sufficient  for  rq>resenting  the  infinite  future  possibilities  for 
the  purpose  of  garbage  collection. 

3.1  Recovery  Line  IVansfomiation  and  DecomposiUon 

Based  on  the  previous  description  of  chedkpoint  consistency,  we  define  a  consistent  global  checkpoint  as  a 
set  of  N  dhedcpoints,  one  from  each  process  and  no  two  of  which  are  related  through  the  happened  befote 
reladtm.  A  recovery  line  refers  to  the  “latest  available”  consistent  global  checkpoint 

Since  a  chedcpoint  graph  represents  program  dqiendency,  vertices  must  be  added  to  and  removed  from 
the  graph  according  to  a  ^iedfic  set  of  rules.  In  an  (^)eratiorud  session,  new  vertices  are  added  to  the 
checkpoint  graph  and  can  not  have  any  outgoing  edges  to  any  existing  vertices^.  (If  a  graph  G'  can  be 
obtained  by  adding  new  vertices  to  another  graph  Gin  this  way,  G*  is  called  a  potential  supergraph  of  G.)  In 
a  recovery  session,  existing  vertices  after  the  local  recovery  line  are  removed  together  from  the  checkpoint 
gn^h.  The  above  rules  for  checkpoint  graph  evolution  then  (tetermine  the  possible  future  checkpoint  graphs, 
and  therefore  the  possible  future  recovery  lines. 

Ws  fim  define  a  set  of  2^  immediate  supergraphs  whidi  are  the  siq)ergt^hs  of  G  and  the  subgraphs  of  d 
as  shown  in  Hg.  4.  Cl  is  constructed  by  adding  anew-node  n,  with  single  incoming  edge  at  the  end  for  each 
process  pi.  Let  U  denote  the  set  of  all  such  new-nodes  and  TIC{G)  denote  the  recovery  line  of  a  checkpoint 
graph  G.  Given  any  possible  future  recovery  line  TIC{G')  of  G,  we  first  apply  recovery  line  tran^ormation 
to  1ZC{G')  to  transform  it  bacitwards  in  time  into  one  of  the  2^  recovery  lines  TlC{d  -  W),  W  C  U, 
followed  by  a  recovery  line  decon^tosition  to  express  the  latter  recovery  line  in  terms  of  the  TV  recovery 
lines  TlCiCl  -  ni),0  <  *  <  TV  -  1.  Ws  will  dononstrate  that  since  recovery  line  transformation  and 
(teonnposition  preserve  all  non-garbage  checkpoints  and  message  logs  of  any  possible  future  recovery  line, 
the  above  TV  recovery  lines  suffice  for  the  purpose  of  optimal  garbage  collection. 

The  example  shown  in  Fig.  5  will  be  used  to  illustrate  the  transformation  and  decomposition  throughout 

Weiticet  with  incoming  edges  fiom  not-yet-coUected  veitices  ere  temporarily  excinded  from  the  checiqraint  graph. 
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Hgure  4:  The  immediate  supeigraphs. 


this  paper.  (Fbnnal  proofs  can  be  found  elsewhere  [26].)  St^ipose  G  in  Fig.  5(a)  is  the  current  checkpoint 
graph  considered  for  garbage  collectioiL  Hg.  5(b)  shows  the  extotded  checkpoint  graph  when  p3  later 
initiates  the  first  rollback,  and  Ge  is  the  checkpoim  graph  immediately  after  the  recovery.  Fig.  5(d)  shows 
another  possible  extended  checkpoint  graph  when  po  initiates  a  secrnid  rollbadc.  We  now  describe  how  to 
transform  and  decompose  TiC^Gi),  a  possible  future  recovery  line  of  G. 

Recovery  Line  lyansfonnation:  Since  any  future  process  execution  can  be  viewed  as  consisting  of  a  num¬ 
ber  of  operational  sessions  and  recovery  sessions,  our  approach  is  to  define  two  elementary  transformations: 
transformation  within  an  operational  session  and  transformation  across  a  recovery  session.  Any  possible 
future  recovery  line  can  then  be  transformed  by  rqreatedly  and  alternately  applying  the  two  transformations. 

Tyansformation  within  an  opoational  session:  First  we  consider  Gc  and  Gd,  where  Gc  is  at  the 
beginning  of  of  a  new  operational  session  and  G,r  is  a  potential  supergraph  of  Gc.  Given  the  recovery 
line  'JtC{Gd),  we  replace  the  checkpoints  X,  Y  and  Z  which  are  not  in  Gc  with  their  corresponding  new- 
nodes  P,  Q  and  R  of  Gc,  as  shown  in  Fig.  5(g).  'R.C{Gd)  =  {A,  B,  X,  Y,  Z}  is  then  transformed  into 
%C{Gg)  ~  {A,  fl,  P,  Q,  jR},  where  G,  is  an  immediate  supergtaph  of  Gc. 

lyansformation  across  a  recovery  session:  Wb  next  consider  Gg  and  Gh  which  is  the  checkpoint 
graffii  at  the  aid  of  the  first  operational  session  (without  the  virtual  checkpoints).  Given  the  recovery  line 
TlCiGg),  we  replace  the  two  new-nodes  Q  and  R  which  are  contributed  by  the  rolled-back  processes  in  the 
first  recovery  with  their  corresponding  checkpoints  on  the  local  recovery  line,  namely,  C  and  D.  TlC{Gg) 
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is  then  transformed  into  1lC{Gf)  =  {A,  B,  P,  C,  D},  where  G/  is  recognized  as  an  immediate  supeigraph 
of  Gb. 

Finally,  since  G/  is  a  potential  siqieigi^h  of  G.  TlC{Gf)  can  be  transfonned  into  1lC{Ge)  = 
{>1,  P,  n2,  G,  D}.  The  possible  future  recovery  line  72G(G,t)  is  then  transfonned  into  the  recovery  line  of 
an  immediate  supeigraph  Gg  of  G. 

Recovery  Line  Decomposition:  Let  mtn(5)  denote  the  set  of  minimal  elements,  i.e.,  vertices  without  any 
incoming  edges,  of  5.  The  recovery  line  decomposition  expresses  each  of  the  2^  recovery  lines  in  terms  of 
the  N  recovery  lines  1Z£{Cl  -  n,),0  <  i  <  JV  -  1,  as  follows. 

1lC(d  -W)  =  mtn(  (J  -  n,)).  (1) 

ni^W 

For  example,  the  recovery  line  of  Gg  =  (^  -  {no,  ni ,  03, 04}  in  Fig,  5(e)  has  the  following  decomposition 
(refening  to  Fig.  6): 

TljCiGe)  =  min(7l£(<5-no]u7i£(<5-ni)u7ZC((5-n3)u7lC(d-n4)) 

=  mtn({i4,P,n2,n3,n4,no,/,ni,  J,G,£)})=  {A,B,n2,C,D}. 

3J2  Message  Log  Redamation 

Based  on  the  approach  of  recovery  line  transformation  and  decomposition,  it  has  been  shown  that  the  union 
of  the  iV  recovery  lines  1l£(Cl  -  n,  ),0  <  *  <  iV  -  1,  contains  all  the  non-garbage  checkpoints  [26].  For 
the  example  shown  in  Ing.  6,  while  all  the  checkpoints  in  G  are  non-obsolete,  only  the  shaded  checkpoints 
in  Ing.  6(f)  are  non-gaibt^e  and  need  to  be  retained.  We  next  demonstrate  that  the  set  of  in-transit  messages 
with  respect  to  the  iV  recovery  lines  contains  all  the  non-garbage  messages. 

Instead  of  considering  each  individual  messr^e,  we  use  its  corresponding  edge  in  the  checkpoint  graph 
for  our  discussion.  Let  (a,  6)  denote  the  directed  edge  from  vertex  a  to  vertex  b.  By  definition,  (a,  b) 
intersects  a  recovery  line  7Z£(G)  if  a  is  on  the  left  hand  side  of  V,£(G)  and  b  is  on  the  right  hand  side  of 
n£(G). 
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(e)  (f) 


Figure  6:  Example  execution  of  our  algorithm. 
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PROPERTY  1  Given  a  checkpoint  graph  G  and  one  of  its  edges  (a,  6),  if  {a,  b)  intersects  a  possible  future 
recovery  line,  (o,  b)  must  intersect  71C(G  —  W)for  some  W  C  U. 

Sketch  of  the  proof.  Again,  wc  use  the  example  in  Rg.  5.  In  particular,  since  the  edge  {E,F)  \nG 
intersects  a  possible  future  recovery  line  'R.C(Gd),  we  want  to  show  that  (£,  F)  must  also  intersect  TZC{Ge) 
where  Gg  =  Cl  -W,W  =  {no,  nj ,  ns,  rid}. 

IVansfbnnation  within  an  optfational  session:  Rest,  we  consider  the  relative  position  of  any  remaining 
checkpoint  of  G  to  the  recovery  lines  1lC(Gd)  and  'R.C{Gg).  Any  such  checkpoint  which  is  on  the 
left  (right)  hand  side  of  TlC{Gd)  must  remain  on  the  left  (right)  hand  side  of  TlC{Gg).  Therefore,  any 
edge  of  G  intersecting  7ZjC(Gd),  for  example  (E,  F),  must  also  intersect  1lC{Gg)  after  the  recovery  line 
transformation. 

IVansformation  across  a  recovery  session:  We  next  consider  TZC{Gg)  and  TlC{Gf).  Any  remaining 
vertex  of  G  which  is  on  the  right  hand  side  of  TIC{ Gg )  must  remain  on  the  right  hand  side  of  TIC{G / );  those 
on  the  left  hand  side  of  7iC{Gg)  remain  on  the  left  hand  side  of  1lC{Gf)  except  for  C  and  D.  Therefore, 
any  remaining  edge  (a,  6)  of  G  intersecting  TZC{Gg)  must  also  intersect  7l£(G/)  except  for  the  possible 
outgoing  edges  of  C  and  D.  But  since  C  and  I?  are  on  tl^  local  recovery  line,  all  their  outgoing  edges  must 
have  been  removed  during  the  rollback;  so  any  such  (o,  b)  must  also  intersect  7l£(Gf).  Again,  {E,  F) 
serves  as  such  an  example. 

Rnally,  we  can  show  that  (E,  F)  also  intersects  7Z£(Ge)  by  again  applying  the  transformation  within 
an  operatioruil  session.  □ 

Before  applying  the  recovery  line  decomposition,  we  first  express  Eq.  (1)  in  another  form  which  is  more 
convenient  for  considering  the  relative  position  of  a  checkpoint  to  a  recovery  line. 

LEMMA  1  mtn(U„j6vv  ^>C(<?  -  nt))  in  Eq.  (1 )  consists  of  the  leftmost  checkpoint  of  each  process  in  the 
union. 

Proof.  If  a  checkpoint  v  of  p,  is  not  the  leftmost  checkpoint  of  p,  in  the  union,  then  v  can  not  be  a 
minimal  element  because  there  exists  at  least  one  checkpoint  on  its  left.  Conversely,  if  v  is  the  leftmost 
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checkpoint  of  pi,  v  must  be  in  mtn(U„.g{y  1ZC{Cl  —  n,-})  because  there  arc  only  N  leftmost  checkpoints 
and  TlC{0  -  W)  =  min(U^.gjy  TlC{Cl  —  n^))  must  contain  N  checkpoints.  □ 

PROPERTY  2  Given  a  checkpoint  graph  G  and  one  of  its  edges  (a,  b),  if  (a,  b)  intersects  TiC{&  -  W) 
for  some  W  CU,  (a,  b)  must  intersect  -  ni)for  some  0  <  t  <  JV  -  1. 

Proof.  We  will  prove  the  property  by  showing  that  if  (a,  b)  does  not  intersect  any  71jC((5  -  n,  ),  (a,  b) 
cannot  intersect  any  -  W). 

Suppose  (a,  6)  does  not  intersect  any  1lC{Cr  —  n,-).  Then,  each  TlC(Cl  -  n,  )  must  lie  either  entirely  on 
the  right  hand  side  of  (a,  6)  or  entirely  on  the  left  hand  side  of  (a,  b). 

Recovery  line  decomposition:  Given  any  11C(G-W),W  C  t/^,  ifall7J£((5-n,  )’s,  n<  €  lY,  are  entirely 
on  the  light  hand  side  of  (a,  6),  1lC{Cf  -  W)  must  also  lie  on  the  right  hand  side  of  (a,  b)  by  Eq.  (1)  and 
Lemma  1;  if  at  least  one  1lC{Cr  —  rii),  n,-  6  W,  lies  oitirely  on  the  left  hand  side  of  (a,  b),  7iC{Cl  -  W)  will 
be  on  the  left  hand  side  of  (a,  b)  again  by  Leimna  1.  bi  either  case,  (a,  b)  cannot  intersect  TZC(G  -  W).  □ 

We  are  now  prepared  to  prove  the  msyor  result  of  this  paper  the  necessary  and  sufficient  condition  for 
a  message  log  to  be  non-garbage. 

THEOREM  1  A  message  log  with  its  corresponding  edge  contained  in  G  is  non-garbage  if  and  only  if  the 
edge  intersects  TlC{Cl  -  rii)  for  some  0  <  i  <  N  —  1. 

Proof.  Any  non-garbage  message  log  must  have  its  corresponding  edge  (a,  b)  intersecting  a  possible 
future  recovery  line.  FromProperty  1,  (o,6)mustinfcrsect7J£((j'-lY)forsomelY  C  U.  From  Property  2, 
(o,  6)  must  intersect  TiCiO  -  n^)  for  some  0  <  i  <  iV  —  1.  The  3/part  comes  from  the  fact  that  every 
TlC{d-  n,  )  is  a  possible  future  recovery  line.  □ 

Theorem  1  also  leads  an  optimal  message  log  reclamation  algorithm  for  finding  all  non-garbage  message 
logs:  first  compute  the  N  recovery  lines  TZC{Cl  -  nj),  0  <  i  <  iV  -  1;  only  those  message  logs  with  their 
corresponding  edges  intersecting  any  of  the  TV  recovery  lines  are  non-garbage.  In  Fig.  6,  the  edge  {E,F) 
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intersects  -  no),  (G,  H)  intersects  7ijC((j  —  n4)  and  none  of  the  edges  intersects  7Z£((5  -  ni), 
-  712)  or  HCiCl  -  nj).  Therefore,  while  ail  the  edges  in  Rg.  6(f)  are  non-obsolete,  only  those 
message  logs  corresponding  to  (£,  F)  and  (G,  H)  need  to  be  retained. 

The  complexity  of  the  algorithm  is  analyzed  as  follows.  The  rollback  propagation  algorithm  in  Rg.  3 
is  of  complexity  0(|F|),  where  |F|  is  the  number  of  edges,  because  every  edge  marked  by  the  algorithm 
can  be  removed.  The  remaining  incoming  edges  of  those  checkpoints  on  the  right  hand  side  of  the  recovery 
line  then  give  the  set  of  edges  intersecting  the  recovery  line.  Since  the  complexity  of  scanning  through  the 
above  set  of  checkpoints  is  no  greater  than  0(|F|),  the  complexity  remains  0(|F|).  Our  optimal  garbage 
collection  algorithm  involves  executing  the  rollback  propagation  algorithm  on  N  checkpoint  graphs  and  is 
therefore  of  complexity  0(7V|F|). 

4  Experimental  Evaluation 

Three  hypercube  programs  are  used  to  illustrate  the  message  log  reclamation  capabilities  and  bmefits  of 
our  algorithm.  They  are  Oil  placement,  Chaimel  router  and  QR  decomposition,  running  on  an  8-node  Intel 
iPSC/2  hypercube.  Communication  traces  are  collected  by  intercepting  the  send  and  receive  system 
calls.  Conununication  trace-driven  simulation  is  then  performed  to  obtain  the  results.  The  execution  time 
for  each  program  is  listed  in  Table  1.  The  checkpoint  interval  is  chosen  to  be  approximately  one  tenth  of 
the  execution  time. 


Table  1:  Execution  time  and  checkpoint  interval. 


Programs 

Cell  placemmt 

Chaimel  router 

QR  decomposition 

Execution  time  (sec) 

324 

469 

370 

Checkpoint  interval  (sec) 

35 

40 

35 

Rgs.  7  compares  our  algorithm  with  the  traditional  algorithm  for  the  three  programs  in  terms  of  the 
number  of  retained  message  logs.  Each  curve  shows  the  remaining  space  overhead  after  garbage  collection 
if  the  algorithm  is  invoked  after  a  certain  number  of  checkpoints  have  been  taken.  Since  the  checkpointing 
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Number  of  retained  message  logs  Number  of  retained  message  logs  Number  of  retained  message  logs 


100000 
80000 
60000 
40000 
20000 
0 

0  8  16  24  32  40  48  56  64  72 

Number  of  checkpoints  taken 

(a) 

140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 

0  8  16  24  32  40  48  56  64  72  80  88 

Number  of  checkpoints  taken 

(b) 


Number  of  checkpoints  taken 
(b) 


Hgure  7:  Message  log  reclamation  results  for  the  three  parallel  programs. 


clocks  on  all  nodes  are  approximately  synchronized,  checkpoints  #8n  through #8(n-i-l)-l  are  taken  at  about 
the  same  time,  which  explains  the  fact  that  the  number  of  messages  is  almost  constant  within  that  interval. 

The  domino  effea  is  illustrated  by  the  constant  increase  in  the  number  of  non-obsolete  message  logs  as 
the  total  number  of  checkpoints  increases,  for  example,  between  checkpoints  #40  and  #64  in  Fig.  7(a)  and 
between  checkpoints  #48  and  #88  in  Hg.  7(b).  The  figure  shows  that  our  algorithm  performs  consistently 
better  than  the  traditional  algorithm. 

5  Exploiting  Piecewise  Determinism 

Instead  of  viewing  the  piecevnse  deterministic  (PWD)  model  as  a  constraint  imposed  upon  the  program 
behavior,  we  consider  allotting  piecewise  determtusm,  whenever  possible  and  desirable,  as  a  mechanism 
for  bounding  rollback  propagation  in  an  uncoordinated  checkpointing  protocol,  the  notion  of  logical 
checkpoints  [25]  has  been  proposed  to  provide  a  unified  dependency  model  for  both  PWD  and  non-PWD 
scenarios.  Essentially,  referring  to  Fig.  8(b),  the  physical  checkpoint  co,  the  message  log  of  mo  (including 
both  the  message  content  and  ordinal  position)  and  tire  underlying  PWD  model  equivalently  place  a  logical 
checkpoint  lo  at  the  end  of  the  state  interval  initiated  by  mo  because  of  the  capability  of  deterministic 
state  reconstruction  up  to  that  point  Fig.  8(b)  shows  a  situation  where  the  PWD  model  is  valid  throughout 
the  execution  and  so  message  logging  can  always  be  onployed  to  insert  additional  logical  checkpoints  to 
effectively  advairee  the  recovery  line,  as  compared  with  (a). 

In  practice,  the  PWD  model  may  not  be  valid  throughout  the  entire  program  execution;  for  example,  it 
may  not  be  appropriate  to  “replay”  iiqiut  events  such  as  real-time  clock  readings  and  resource  status.  When 
the  PWD  model  becomes  invalid,  it  can  only  be  resumed  after  the  irext  physical  checkpoint  is  taken  because 
the  deterministic  state  reconstruction  for  curroit  checkpoint  interval  has  been  interrupted.  Hg.  8(c)  illustrates 
a  situation  where  the  PWD  model  can  only  be  partially  exploited.  Suppose  the  piecewise  determinism  is  not 
available  for  the  parts  of  execution  indicated  by  the  shaded  bars.  Then  the  logical  checkpoints  belonging  to 
those  regions  are  no  longer  available.  Fig.  9(a)  shows  the  corresponding  checkpoint  graphs  including  all  the 
physical  and  logical  checkpoints;  Fig.  9(b)-(f)  apply  the  optimal  garbage  collection  algorithm  to  the  above 
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Global  recovery  line  (a) 


Figure  8:  Piecewise  detenninism  and  the  availability  of  logical  checkpoints.  (The  shaded  bars  in  (c)  indicate 
those  parts  of  process  execution  which  do  not  satisfy  the  PWD  model.) 
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checkpoint  graph.  Note  that  the  logical  checkpoint  Lq  in  Hg.  9(b)  being  non-garbage  implies  that  bmh  the 
physical  chedcpoint  cq  and  the  message  log  of  mo  (Hg.  8(c))  are  non-garbage,  and  mo  must  be  the  first  new 
messa^  to  be  proces^  after  po  restarts  from  co.  In  contrast.  Fig.  9(f)  idmitifies  the  two  thick  edges  as 
non-garbage  edges  whidi  means  the  contents  of  the  message  logs  of  mi  and  m2  (Fig.  8(c))  are  non-garbage 
but  the  oidiiud  position  information  can  be  discarded 

6  Summary 

Fbr  systems  requiring  message  logging  to  record  in-transit  messages,  we  have  derived  the  necessary 
and  sufficient  condition  for  idoitifying  ail  garbage  message  logs,  which  leads  to  an  optimal  message  log 
reclamation  algorithm.  Combining  it  with  a  previous  optimal  checkpoint  reclamation  algorithm,  we  have 
developed  an  optimal  garbage  collection  algorithm  for  minimizing  the  space  overhead  of  uncoordinated 
checkpointing.  The  overall  complexity  of  the  algorithm  is  0(J\r|JEJ|)  where  iV  is  the  number  of  processes 
and  I  £|  is  the  number  of  edges  in  the  checkpoint  grrqrh.  Communication  trace-driven  simulation  results  for 
i  three  parallel  programs  showed  that  the  algorithm  can  be  effective  in  reducing  die  space  overhead  for  real 
sqiplications. 
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